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Abstract 

We  had  previously  shown  that  regularization  principles  lead  to  approximation  schemes  which  are  equivalent 
to  networks  with  one  layer  of  hidden  units,  called  Regularization  Networks.  In  particular  we  had  discussed 
how  standard  smoothness  functionals  lead  to  a  subclass  of  regularization  networks,  the  well-known  Radial 
Basis  Functions  approximation  schemes.  In  this  paper  we  show  that  regularization  networks  encompass 
a  much  broader  range  of  approximation  schemes,  including  many  of  the  popular  general  additive  models 
and  some  of  the  neural  networks.  In  particular  we  introduce  new  classes  of  smoothness  functionals  that 
lead  to  different  classes  of  basis  functions.  Additive  splines  as  well  as  some  tensor  product  splines  can 
be  obtained  from  appropriate  classes  of  smoothness  functionals.  Furthermore,  the  same  extension  that 
extends  Radial  Basis  Functions  (RBF)  to  Hyper  Basis  Functions  (HBF)  leads  from  additive  models  to  ridge 
approximation  models,  containing  as  special  cases  Breiman’s  hinge  functions  and  some  forms  of  Projection 
Pursuit  Regression.  We  propose  to  use  the  term  Generalized  Regularization  Networks  for  this  broad  class  of 
approximation  schemes  that  follow  from  an  extension  of  regularization.  In  the  probabilistic  interpretation 
of  regularization,  the  different  classes  of  basis  functions  correspond  to  different  classes  of  prior  probabilities 
on  the  approximating  function  spaces,  that  is  to  different  types  of  smoothness  assumptions.  In  the  final 
part  of  the  paper,  we  show  the  relation  between  activation  functions  of  the  Gaussian  and  sigmoidal  type 
by  considering  the  simple  case  of  the  kernel  G(x)  =  x  . 

In  summary,  different  multilayer  networks  with  one  hidden  layer,  which  we  collectively  call  Generalized 
Regularization  Networks,  correspond  to  different  classes  of  priors  and  associated  smoothness  functionals 
in  a  classical  regularization  principle.  Three  broad  classes  are  a)  Radial  Basis  Functions  that  generalize 
into  Hyper  Basis  Functions,  b)  some  tensor  product  splines,  and  c)  additive  splines  that  generalize  into 
schemes  of  the  type  of  ridge  approximation,  hinge  functions  and  one-hidden-layer  perceptrons. 
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1  Introduction 

In  recent  papers  we  and  others  have  argued  that  the 
task  of  learning  from  examples  can  be  considered  in 
many  cases  to  be  equivalent  to  multivariate  function  ap¬ 
proximation,  that  is,  to  the  problem  of  approximating  a 
smooth  function  from  sparse  data,  the  examples.  The 
interpretation  of  an  approximation  scheme  in  terms  of 
networks,  and  viceversa,  has  also  been  extensively  dis¬ 
cussed  (Barron  and  Barron,  1988;  Poggio  and  Girosi, 
1989,  1990;  Broomhead  and  Lowe,  1988). 

In  a  series  of  papers  we  have  explored  a  specific,  al¬ 
beit  quite  general,  approach  to  the  problem  of  function 
approximation.  The  approach  is  based  on  the  recogni¬ 
tion  that  the  ill-posed  problem  of  function  approxima¬ 
tion  from  sparse  data  must  be  constrained  by  assum¬ 
ing  an  appropriate  prior  on  the  class  of  approximating 
functions.  Regularization  techniques  typically  impose 
smoothness  constraints  on  the  approximating  set  of  func¬ 
tions.  It  can  be  argued  that  some  form  of  smoothness 
is  necessary  to  allow  meaningful  generalization  in  ap¬ 
proximation  type  problems  (Poggio  and  Girosi,  1989, 
1990).  A  similar  argument  can  also  be  used  in  the  case 
of  classification  where  smoothness  involves  the  classifica¬ 
tion  boundaries  rather  than  the  input-output  mapping 
itself.  Our  use  of  regularization,  which  follows  the  clas¬ 
sical  technique  introduced  by  Tikhonov  (1963,  1977), 
identifies  the  approximating  function  as  the  minimizer 
of  a  cost  functional  that  includes  an  error  term  and  a 
smoothness  functional,  usually  called  a  stabilizer.  In  the 
Bayesian  interpretation  of  regularization  the  stabilizer 
corresponds  to  a  smoothness  prior,  and  the  error  term 
to  a  model  of  the  noise  in  the  data  (usually  Gaussian 
and  additive). 

In  Poggio  and  Girosi  (1989)  we  showed  that  regular¬ 
ization  principles  lead  to  approximation  schemes  which 
are  equivalent  to  networks  with  one  “hidden”  layer, 
which  we  call  Regularization  Networks  (RN).  In  par¬ 
ticular,  we  described  how  a  certain  class  of  radial  sta¬ 
bilizers  -  and  the  associated  priors  in  the  equivalent 
Bayesian  formulation  -  lead  to  a  subclass  of  regular¬ 
ization  networks,  the  already-known  Radial  Basis  Func¬ 
tions  (Powell,  1987,  1990;  Micchelli,  1986;  Dyn,  1987) 
that  we  have  extended  to  Hyper  Basis  Functions  (Poggio 
and  Girosi,  1990,  1990a).  The  regularization  networks 
with  radial  stabilizers  we  studied  include  all  the  classi¬ 
cal  one-dimensional  as  well  as  multidimensional  splines 
and  approximation  techniques,  such  as  radial  and  non- 
radial  Gaussian  or  multiquadric  functions.  In  Poggio  and 
Girosi  (1990,  1990a)  we  have  extended  this  class  of  net¬ 
works  to  Hyper  Basis  Functions  (HBF).  In  this  paper  we 
show  that  an  extension  of  Regularization  Networks,  that 
we  propose  to  call  Generalized  Regularization  Networks 
(GRN),  encompasses  an  even  broader  range  of  approx¬ 
imation  schemes,  including,  in  addition  to  HBF,  tensor 
product  splines,  many  of  the  general  additive  models, 
and  some  of  the  neural  networks. 

The  plan  of  the  paper  is  as  follows.  We  first  discuss 
the  solution  of  the  variational  problems  of  regularization 
in  a  rather  general  form.  We  then  introduce  three  differ¬ 
ent  classes  of  stabilizers  -  and  the  corresponding  priors 
in  the  equivalent  Bayesian  interpretation  -  that  lead  to 


different  classes  of  basis  functions:  the  well-know  radial 
stabilizers,  tensor-product  stabilizers,  and  the  new  addi¬ 
tive  stabilizers  that  underlie  additive  splines  of  different 
types.  It  is  then  possible  to  show  that  the  same  ex¬ 
tension  that  extends  Radial  Basis  Functions  to  Hyper 
Basis  Functions  leads  from  additive  models  to  ridge  ap¬ 
proximation,  containing  as  special  cases  Breiman’s  hinge 
functions  (1992)  and  ridge  approximations  of  the  type 
of  Projection  Pursuit  Regression  (PPR)  (Friedman  and 
Stuezle,  1981;  Huber,  1985).  Simple  numerical  exper¬ 
iments  are  then  described  to  illustrate  the  theoretical 
arguments. 

In  summary,  the  chain  of  our  arguments  shows  that 
ridge  approximation  schemes  such  as 

d‘ 

/w  =  ]C  • x)  • 

i=l 

where 

M»)  =  E  «£<?(»-<£) 

a  =  l 

are  approximations  of  Regularization  Networks  with  ap¬ 
propriate  additive  stabilizers.  The  form  of  G  depends 
on  the  stabilizer,  and  includes  in  particular  cubic  splines 
(used  in  typical  implementations  of  PPR)  and  one¬ 
dimensional  Gaussians.  It  seems,  however,  impossible 
to  directly  derive  from  regularization  principles  the  sig¬ 
moidal  activation  functions  used  in  Multilayer  Percep- 
trons.  We  discuss  in  a  simple  example  the  close  relation¬ 
ship  between  basis  functions  of  the  hinge,  the  sigmoid 
and  the  Gaussian  type. 

The  appendices  deal  with  observations  related  to  the 
main  results  of  the  paper  and  more  technical  details. 

2  The  regularization  approach  to  the 
approximation  problem 

Suppose  that  the  set  g  =  {(x*,  Vi)  G  Rd  x  °f  data 

has  been  obtained  by  random  sampling  of  a  function  /, 
belonging  to  some  space  of  functions  X  defined  on  Rd, 
in  the  presence  of  noise,  and  suppose  we  are  interested 
in  recovering  the  function  /,  or  an  estimate  of  it,  from 
the  set  of  data  g.  This  problem  is  clearly  ill-posed,  since 
it  has  an  infinite  number  of  solutions.  In  order  to  choose 
one  particular  solution  we  need  to  have  some  a  priori 
knowedge  of  the  function  that  has  to  be  reconstructed. 
The  most  common  form  of  a  priori  knowledge  consists  in 
assuming  that  the  function  is  smooth,  in  the  sense  that 
two  similar  inputs  correspond  to  two  similar  outputs. 
The  main  idea  underlying  regularization  theory  is  that 
the  solution  of  an  ill-posed  problem  can  be  obtained  from 
a  variational  principle,  which  contains  both  the  data  and 
prior  smoothness  information.  Smoothness  is  taken  into 
account  by  defining  a  smoothness  functional  <f>[f  ]  in  such 
a  way  that  lower  values  of  the  functional  correspond  to 
smoother  functions.  Since  we  look  for  a  function  that 
is  simultaneously  close  to  the  data  and  also  smooth,  it 
is  natural  to  choose  as  a  solution  of  the  approximation 
problem  the  function  that  minimizes  the  following  func¬ 
tional: 


N 

H[f}  =  X^(Xl')  ~  Vi)2  +  X^f]  ■  (X) 

t=l 

where  A  is  a  positive  number  that  is  usually  called  the 
regularization  parameter.  The  first  term  is  enforcing 
closeness  to  the  data,  and  the  second  smoothness,  while 
the  regularization  parameter  controls  the  tradeoff  be¬ 
tween  these  two  terms. 

It  can  be  shown  that,  for  a  wide  class  of  functionals  <j>, 
the  solutions  of  the  minimization  of  the  functional  (1)  all 
have  the  same  form.  Although  a  detailed  and  rigorous 
derivation  of  the  solution  of  this  problem  is  out  of  the 
scope  of  this  memo,  a  simple  derivation  of  this  general 
result  is  presented  in  appendix  (A).  In  this  section  we 
just  present  a  family  of  smoothness  functionals  and  the 
corresponding  solutions  of  the  variational  problem.  We 
refer  the  reader  to  the  current  literature  for  the  mathe¬ 
matical  details  (Wahba,  1990;  Madych  and  Nelson,  1990; 
Dyn,  1987). 

We  first  need  to  give  a  more  precise  definition  of 
what  we  mean  by  smoothness  and  define  a  class  of  suit¬ 
able  smoothness  functionals.  We  refer  to  smoothness  as 
a  measure  of  the  "oscillatory”  behavior  of  a  function. 
Therefore,  within  a  class  of  differentiable  functions,  one 
function  will  be  said  to  be  smoother  than  another  one  if 
it  oscillates  less.  If  we  look  at  the  functions  in  the  fre¬ 
quency  domain,  we  may  say  that  a  function  is  smoother 
than  another  one  if  it  has  less  energy  at  high  frequency 
(smaller  bandwidth).  The  high  frequency  content  of  a 
function  can  be  measured  by  first  high-pass  filtering  the 
function,  and  then  measuring  the  power,  that  is  the  L2 
norm,  of  the  result.  In  formulas,  this  suggests  defining 
smoothness  functionals  of  the  form: 


where  "indicates  the  Fourier  transform,  G  is  some  posi¬ 
tive  function  that  falls  off  to  zero  as  ||s||  — >  oo  (so  that  A 
is  an  high-pass  filter)  and  for  which  the  class  of  functions 
such  that  this  expression  is  well  defined  is  not  empty.  For 
a  well  defined  class  of  functions  G  (Madych  and  Nelson, 
1990;  Dyn,  1991)  this  functional  is  a  semi-norm,  with  a 
finite  dimensional  null  space  Af .  The  next  section  will 
be  devoted  to  giving  examples  of  the  possible  choices  for 
the  stabilizer  <f>.  For  the  moment  we  just  assume  that  it 
can  be  written  as  in  eq.  (2),  and  make  the  additional  as¬ 
sumption  that  G  is  symmetric,  so  that  its  Fourier  trans¬ 
form  G  is  real  and  symmetric.  In  this  case  it  is  possible 
to  show  (see  appendix  (A)  for  a  sketch  of  the  proof) 
that  the  function  that  minimizes  the  functional  (1)  has 
the  form: 

N  k 

/(x)  =  ]CCiG(X-  (3) 

i  —  1  a  =  1 

where  =1  is  a  basis  in  the  k  dimensional  null  space 

Af  and  the  coefficients  da  and  c*  satisfy  the  following 
linear  system: 


(G+  \I)c  +  VTd  =  y 

IPc  =  0 

where  I  is  the  identity  matrix,  and  we  have  defined 

(y )i  =  Vi  ,  (c)i  =  Ci  ,  (d)j  =  di  , 

(G)ij  =  G(Xi  -  Xj)  ,  (V)ai  =  Ipa(Xi) 

The  existence  of  a  solution  to  the  linear  system  shown 
above  is  guaranteed  by  the  existence  of  the  solution  of 
the  variational  problem.  The  case  of  A  =  0  corresponds 
to  pure  interpolation,  and  in  this  case  the  solvability  of 
the  linear  system  depends  on  the  properties  of  the  basis 
function  G. 

The  approximation  scheme  ofeq.  form  (3)  has  a  sim¬ 
ple  interpretation  in  terms  of  a  network  with  one  layer 
of  hidden  units,  which  we  call  a  Regularization  Network 
(RN).  Appendix  B  describes  the  simple  extension  to  vec¬ 
tor  output  scheme. 

3  Classes  of  stabilizers 

In  the  previous  section  we  considered  the  class  of  stabi¬ 
lizers  of  the  form: 


and  we  have  seen  that  the  solution  of  the  minimization 
problem  always  has  the  same  form.  In  this  section  we 
discuss  three  different  types  of  stabilizers  belonging  to 
the  class  (4),  corresponding  to  different  properties  of  the 
basis  functions  G.  Each  of  them  corresponds  to  differ¬ 
ent  a  priori  assumptions  about  the  smoothness  of  the 
function  that  must  be  approximated. 

3.1  Radial  stabilizers 

Most  of  the  commonly  used  stabilizers  have  radial  sim- 
metry,  that  is,  they  satisfy  the  following  equation: 

0[/(x)]  =  <t>[f{Rx)} 

for  any  rotation  matrix  R.  This  choice  reflects  the  a 
priori  assumption  that  all  the  variables  have  the  same 
relevance,  and  that  there  are  no  priviliged  directions. 
Rotation  invariant  stabilizers  correspond  clearly  to  ra¬ 
dial  basis  function  G(||x||).  Much  attention  has  been 
dedicated  to  this  case,  and  the  corresponding  approx¬ 
imation  technique  is  known  as  Radial  Basis  Functions 
(Micchelli,  1986;  Powell,  1987).  The  class  of  admissible 
Radial  Basis  Functions  is  the  class  of  conditionally  pos¬ 
itive  definite  functions  of  any  order,  since  it  has  been 
shown  (Madych  and  Nelson,  1991;  Dyn,  1991)  that  in 
this  case  the  functional  of  eq.  (4)  is  a  semi-norm,  and 
the  associated  variational  problem  is  well  defined.  All 
the  Radial  Basis  Functions  can  therefore  be  derived  in 
this  framework.  We  explicitly  give  two  important  exam¬ 
ples. 

Duchon  multivariate  splines 


Duchon  (1977)  considered  measures  of  smoothness  of  the 
form 


m  =  [  ds  iis|i2mi/(s)i2 . 

JRd 

In  this  case  G( s)  =  ppsr  and  the  corresponding  basis 
function  is  therefore 

.  _  j  ||x||2m_<iln  ||x||  if  2m  >  d  and  d  is  even 
*X'  —  (  ||x||2m_<J  otherwise. 

(5) 

In  this  case  the  null  space  of  <f>[f)  is  the  vector  space 
of  polynomials  of  degree  at  most  m  in  d  variables,  whose 
dimension  is 


These  basis  functions  are  radial  and  conditionally  pos¬ 
itive  definite,  so  that  they  represent  just  particular  in¬ 
stances  of  the  well  known  Radial  Basis  Functions  tech¬ 
nique  (Micchelli,  1986;  Wahba,  1990).  In  two  dimen¬ 
sions,  for  m  =  2,  eq.  (5)  yields  the  so  called  “thin 
plate”  basis  function  G(x)  =  ||x||2  In  ||x||  (Harder  and 
Desmarais,  1972),  depicted  in  figure  (1). 

The  Gaussian 

A  stabilizer  of  the  form 


G{r)  =  e-^ 

Gaussian,  p.d. 

G(r)  =  sjr2  +  c2 

multiquadric,  c.p.d.  1 

G{r)  ^c2+r2 

inverse  multiquadric,  p.d. 

G(r)  =  r2n+1 

multivariate  splines,  c.p.d.  n 

Jh 

g 

CM 

S- 

II 

Xs 

multivariate  splines,  c.p.d.  n 

3.2  Tensor  product  stabHizers 

An  alternative  to  choosing  a  radial  function  G  in  the 
stabilizer  (4)  is  a  tensor  product  type  of  basis  function, 
that  is  a  function  of  the  form 

G(s)  =  n^=i<?(«j)  (6) 

where  sj  is  the  j-th  coordinate  of  the  vector  s,  and  g 
is  an  appropriate  one-dimensional  function.  When  g  is 
positive  definite  the  functional  <j>[f  ]  is  clearly  a  norm  and 
its  null  space  is  empty.  In  the  case  of  a  conditionally 
positive  definite  function  the  structure  of  the  null  space 
can  be  more  complicated  and  we  do  not  consider  it  here. 
Stabilizers  with  G(s)  as  in  equation  (6)  have  the  form 

Jr*  nUMsj 


*[f]=  (  dse^  |/(s)|2, 

Jr* 
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where  f3  is  a  fixed  positive  parameter,  has  G(s)  =  e  f> 
and  as  basis  function  the  Gaussian  function,  represented 
in  figure  (2).  The  Gaussian  function  is  positive  definite, 
and  it  is  well  known  from  the  theory  of  reproducing  ker¬ 
nels  that  positive  definite  functions  can  be  used  to  de¬ 
fine  norms  of  the  type  (4).  Since  is  a  norm,  its  null 
space  contains  only  the  zero  element,  and  the  additional 
null  space  terms  of  eq.  (3)  are  not  needed,  unlike  in 
Duchon  splines.  A  disadvantage  of  the  Gaussian  is  the 
appearance  of  the  scaling  parameter  (3,  while  Duchon 
splines,  being  homogeneous  functions,  do  not  depend  on 
any  scaling  parameter.  However,  it  is  possible  to  devise 
good  heuristics  that  furnish  sub-optimal,  but  still  good, 
values  of  f3,  or  good  starting  points  for  cross-validation 
procedures. 

Other  Basis  Functions 

Here  we  give  a  list  of  other  functions  that  can  be  used  as 
basis  functions  in  the  Radial  Basis  Functions  technique, 
and  that  are  therefore  associated  with  the  minimization 
of  some  functional.  In  the  following  table  we  indicate  as 
“p.d.”  the  positive  definite  functions,  which  do  not  need 
any  polynomial  term  in  the  solution,  and  as  “c.p.d.  k” 
the  conditionally  positive  definite  functions  of  order  k, 
which  need  a  polynomial  of  degree  k  in  the  solution. 


which  leads  to  a  tensor  product  basis  function 


G(x)  =  H  f=lg(xi) 

where  Xj  is  the  y-th  coordinate  of  the  vector  x  and  g(x) 
is  the  Fourier  transform  of  g(s).  An  interesting  example 
is  the  one  corresponding  to  the  choice: 


9{‘) 


1 

IT?  ’ 


which  leads  to  the  basis  function: 


G(x)  =  nf=1e-^l  =  e  ^=1 |a!il  =  e-IWUa  . 

This  basis  function  is  interesting  from  the  point  of  view 
of  VLSI  implementations,  because  it  requires  the  com¬ 
putation  of  the  Li  norm  of  the  input  vector  x,  which 
is  usually  easier  to  compute  than  the  Euclidean  norm 
L2.  However,  this  basis  function  in  not  very  smooth, 
as  shown  in  figure  (3),  and  its  performance  in  practical 
cases  should  first  be  tested  experimentally. 

We  notice  that  the  choice 


<?(«)  =  e 


leads  again  to  the  Gaussian  basis  function  G(x)  = 


3.3  Additive  stabilizers 

We  have  seen  in  the  previous  section  how  some  tensor 
product  approximation  schemes  can  be  derived  in  the 
framework  of  regularization  theory.  We  now  will  see  that 
is  also  possible  to  derive  the  class  of  additive  approxima¬ 
tion  schemes  in  the  same  framework,  where  by  additive 
approximation  we  mean  an  approximation  of  the  form 

/(*)  =  E  M**)  (7) 

M=1 

where  x ^  is  the  yu-th  component  of  the  input  vector  x 
and  the  /M  are  one-dimensional  functions  that  will  be 
defined  as  the  additive  components  of  /  (from  now  on 
Greek  letter  indices  will  be  used  in  association  with  com¬ 
ponents  of  the  input  vectors).  Additive  models  are  well 
known  in  statistics  (see  Hastie  and  Tibshirani’s  book, 
1990)  and  can  be  consider  as  a  generalization  of  linear 
models.  They  are  appealing  because,  being  essentially  a 
superposition  of  one-dimensional  functions,  they  have  a 
low  complexity,  and  they  share  with  linear  models  the 
feature  that  the  effects  of  the  different  variables  can  be 
examined  separately. 

The  simplest  way  to  obtain  such  an  approximation 
scheme  is  to  choose  a  stabilizer  that  corresponds  to  an 
additive  basis  function  (see  fig.  4  for  an  example): 

n 

G(x)  =  EM*M)  (8) 

m=i 

where  8 M  are  certain  fixed  parameters.  Such  a  choice,  in 
fact,  leads  to  an  approximation  scheme  of  the  form  (7) 
in  which  the  additive  components  have  the  form: 


In  the  limit  of  e  going  to  zero  the  denominator  in  ex¬ 
pression  (11)  approaches  eq.  (10),  and  the  basis  func¬ 
tion  (12)  approaches  a  basis  function  that  is  the  sum  of 
one-dimensional  basis  functions.  In  this  paper  we  do  not 
discuss  this  limit  process  in  a  rigorous  way.  Instead  we 
outline  another  way  to  obtain  additive  approximations 
in  the  framework  of  regularization  theory. 

Let  us  assume  that  we  know  a  priori  that  the  function 
/  that  we  want  to  approximate  is  additive,  that  is: 

/(*)  =  E-M*") 

M  =  1 

We  then  apply  the  regularization  approach  and  impose  a 
smoothness  constraint,  not  on  the  function  /  as  a  whole, 
but  on  each  single  additive  component,  through  a  regu¬ 
larization  functional  of  the  form: 


N 


m  =  E  (*  - 

i=l 


E 
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where  8^  are  given  positive  parameters  which  allow  us  to 
impose  different  degrees  of  smoothness  on  the  different 
additive  components.  The  minimizer  of  this  functional 
is  found  with  the  same  technique  described  in  appendix 
(A),  and  skipping  null  space  terms,  it  has  the  usual  form 


where 


N 

/w  =  Eci(?(x  _  x>) 

i  =  1 


(13) 


N 

/m(*)=^Ec«G(*M-:c  0  (9) 

«=1 

Notice  that  the  additive  components  are  not  independent 
at  this  stage,  since  there  is  only  one  set  of  coefficients  Cj. 
We  postpone  the  discussion  of  this  point  to  section  (4.2). 

We  would  like  to  write  stabilizers  corresponding  to 
the  basis  function  (8)  in  the  form  (4),  where  G(s)  is  the 
Fourier  transform  of  G(x).  We  notice  that  the  Fourier 
transform  of  an  additive  function  like  the  one  in  equation 
(8)  is  a  distribution.  For  example,  in  two  dimensions  we 
obtain 


d 

G(x  -  x{)  =  E  ~  xi)  • 

/*=i 

as  in  eq.  (8). 

We  notice  that  the  additive  component  of  eq.  (13) 
can  be  written  as 

4.(*#,)  =  E^(*',-*n 

t=i 

where  we  have  defined 


G(s)  =  8xg(sx)6{sy)  +  8yg(sy)6(sx)  (10) 

and  the  interpretation  of  the  reciprocal  of  this  expression 
is  delicate.  However,  almost  additive  basis  functions  can 
be  obtained  if  we  approximate  the  delta  functions  in  eq. 
(10)  with  Gaussians  of  very  small  variance.  Consider, 
for  example  in  two  dimensions,  the  stabilizer: 


m  =  f 

JR‘ 


ds 


l/(s)!2 


lRd  8xg(sx)e  (f)5  +  8yg(sy)e 
This  corresponds  to  a  basis  function  of  the  form: 

G(x,y)  =  8xg{x)e~(:‘y'‘  +  . 


(11) 


(12) 


The  additive  components  are  therefore  not  independent 
because  the  parameters  8 M  are  fixed.  If  the  8 y  were  free 
parameters,  the  coefficients  cf  would  be  independent,  as 
well  as  the  additive  components. 

Notice  that  the  two  ways  we  have  outlined  for  deriv¬ 
ing  additive  approximation  from  regularization  theory 
are  equivalent.  They  both  start  from  a  prior  assumption 
of  additivity  and  smoothness  of  the  class  of  functions 
to  be  approximated.  In  the  first  technique  the  two  as¬ 
sumptions  are  both  in  the  choice  of  the  stabilizer,  (eq. 
11);  in  the  second  they  are  made  explicit  and  exploited 
sequentially. 


4  Extensions:  from  Regularization 
Networks  to  Generalized 
Regularization  Networks 

In  this  section  we  will  first  review  some  extensions  of  reg¬ 
ularization  networks,  and  then  will  apply  them  to  Radial 
Basis  Functions  and  to  additive  splines. 

A  fundamental  problem  in  almost  all  practical  appli¬ 
cations  in  learning  and  pattern  recognition  is  the  choice 
of  the  relevant  variables.  It  may  happen  that  some  of 
the  variables  are  more  relevant  than  others,  that  some 
variables  are  just  totally  irrelevant,  or  that  the  relevant 
variables  are  linear  combinations  of  the  original  ones. 
It  can  therefore  be  useful  to  work  not  with  the  original 
set  of  variables  x,  but  with  a  linear  transformation  of 
them,  Wx,  where  W  is  a  possibily  rectangular  matrix. 
In  the  framework  of  regularization  theory,  this  can  be 
taken  into  account  by  making  the  assumption  that  the 
approximating  function  /  has  the  form  /(x)  =  F(Wx) 
for  some  smooth  function  F.  The  smoothness  assump¬ 
tion  is  now  made  directly  on  F ,  through  a  smoothness 
functional  <j>[F]  of  the  form  (4).  The  regularization  func¬ 
tional  is  now  expressed  in  terms  of  F  as 

N 

H[F ]  =  -  F(Zi))2  + 

i  =  l 

where  z,-  =  Wx,.  The  function  that  minimizes  this  func¬ 
tional  is  clearly,  accordingly  to  the  results  of  section  (2), 
of  the  form: 

N 

F(z)  =  YCiG( z  -  z i)  . 

2  —  1 

(plus  eventually  a  polynomial  in  z).  Therefore  the  solu¬ 
tion  for  /  is: 

N 

/(x)  =  f(Wx)  =  Y,  CM Wx  -  Wxj)  (14) 
2—1 

This  argument  is  exact  for  given  and  known  W,  as  in 
the  case  of  classical  Radial  Basis  Functions.  Usually  the 
matrix  W  is  unknown,  and  it  must  be  estimated  from 
the  examples.  Estimating  both  the  coefficients  c*  and 
the  matrix  W  by  least  squares  is  probably  not  a  good 
idea,  since  we  would  end  up  trying  to  estimate  a  num¬ 
ber  of  parameters  that  is  larger  than  the  number  of  data 
points  (though  one  may  use  regularized  least  squares). 
Therefore,  it  has  been  proposed  to  replace  the  approxi¬ 
mation  scheme  of  eq.  (14)  with  a  similar  one,  in  which  he 
basic  shape  of  the  approximation  scheme  is  retained,  but 
the  number  of  basis  functions  is  decreased.  The  result¬ 
ing  approximating  function  that  we  call  the  Generalized 
Regularization  Network  (GRN)  is: 

n 

/(x)  =  Y  C°G( Wx  -  Wt„)  .  (15) 

a  =  l 

where  n  <  N  and  the  centers  ta  are  chosen  according  to 
some  heuristic  (Moody  and  Darken,  1989),  or  are  consid¬ 
ered  as  free  parameters  (Poggio  and  Girosi,  1989,  1990). 


The  coefficients  ca  and  the  elements  of  the  matrix  W 
are  estimated  accordingly  to  a  least  squares  criterion. 
The  elements  of  the  matrix  W  could  also  be  estimated 
through  cross-validation,  which  may  be  a  formally  more 
appropriate  technique. 

In  the  special  case  in  which  the  matrix  W  and  the 
centers  are  kept  fixed,  the  resulting  technique  is  one  orig¬ 
inally  proposed  by  Broomhead  and  Lowe  (1988),  and  the 
coefficients  satisfy  the  following  linear  equation: 

GtGc =  Gt y , 

where  we  have  defined  the  following  vectors  and  matri¬ 
ces: 

(y)i  =  Vi  i  (c)a  =  c«  i  ( G)ia  —  G(xj  ta)  . 

This  technique,  which  has  become  quite  common  in  the 
neural  network  community,  has  the  advantage  of  retain¬ 
ing  the  form  of  the  regularization  solution,  while  being 
less  complex  to  compute.  A  complete  theoretical  analy¬ 
sis  has  not  yet  been  given,  but  some  results,  in  the  case 
in  which  the  matrix  W  is  set  to  identity,  are  already 
available  (Sivakumar  and  Ward,  1991). 

The  next  sections  discuss  approximation  schemes  of 
the  form  (15)  in  the  cases  of  radial  and  additive  basis 
functions. 

4.1  Extensions  of  Radial  Basis  Functions 

In  the  case  in  which  the  basis  function  is  radial,  the 
approximation  scheme  of  eq.  (15)  becomes: 

/M  =  X)c„G(||x-ta||w) 

a=l 

where  we  have  defined  the  weighted  norm: 

||x||w  =  x  WrWx  .  (16) 

The  basis  functions  of  eq.  (15)  are  not  radial  anymore, 
or,  more  accurately,  they  are  radial  in  the  metric  defined 
by  eq.  (16).  This  means  that  the  level  curves  of  the  basis 
functions  are  not  circles,  but  ellipses,  whose  axes  do  not 
need  to  be  aligned  with  the  coordinate  axis.  Notice  that 
in  this  case  what  is  important  is  not  the  matrix  W  itself, 
but  rather  the  product  matrix  WTW.  Therefore,  by  the 
Cholesky  decomposition,  it  is  sufficient  to  take  W  upper 
triangular.  The  approximation  scheme  defined  by  eq. 
(15)  has  been  discussed  in  detail  in  (Poggio  and  Girosi, 
1990;  Girosi,  1992),  so  we  do  will  not  discuss  it  further, 
and  will  consider,  in  the  next  section,  its  analogue  in  the 
case  of  additive  basis  functions. 

4.2  Extensions  of  additive  splines 

In  the  previous  sections  we  have  seen  an  extension  of 
the  classical  regularization  technique.  In  this  section  we 
derive  the  form  that  this  extension  takes  when  applied 
to  additive  splines.  The  resulting  scheme  is  very  similar 
to  Projection  Pursuit  Regression  (Friedman  and  Stuezle, 
1981;  Huber,  1985). 

We  start  from  the  “classical”  additive  spline,  derived 
from  regularization  in  section  (3.3): 


(17) 

i=l  (i  =  l 

In  this  scheme  the  smoothing  parameters  8^  should  be 
known,  or  can  be  estimated  by  cross-validation.  An  al¬ 
ternative  to  cross-validation  is  to  consider  the  param¬ 
eters  8  ^  as  free  parameters ,  and  estimate  them  with  a 
least  square  technique  together  with  the  coefficients  c;. 
If  the  parameters  8  ^  are  free,  the  approximation  scheme 
of  eq.  (17)  becomes  the  following: 

/(x)  =  X  X  ~  xi) 

i= 1  n—1 

where  the  coefficients  cf  are  now  independent.  Of 
course,  now  we  must  estimate  N  x  d  coefficients  instead 
of  just  N ,  and  we  are  likely  to  encounter  the  overfitting 
problem.  We  then  adopt  the  same  idea  presented  in  sec¬ 
tion  (4),  and  consider  an  approximation  scheme  of  the 
form 

n  d 

=  (18) 
a=l  (i=l 

in  which  the  number  of  centers  is  smaller  than  the  num¬ 
ber  of  examples,  reducing  the  number  of  coefficients  that 
must  be  estimated.  We  notice  that  eq.  (18)  can  be  writ¬ 
ten  as 

/(x)  =  X  /#•(*") 

**=1 

where  each  additive  component  has  the  form: 

n 

/m(*m)  =  Xc£g(*',-*£)- 

*  a  — 1 

Therefore  another  advantage  of  this  technique  is  that 
the  additive  components  are  now  independent ,  each  of 
them  being  a  one-dimensional  Radial  Basis  Functions. 

We  can  now  use  the  same  argument  from  section  (4)  to 
introduce  a  linear  transformation  of  the  inputs  x  — >  Wx, 
where  W  is  a  d'  x  d  matrix.  Calling  the  /i-th  column 
of  W,  and  performing  the  substitution  x  — *  Wx  in  eq. 
(18),  we  obtain 

n  d' 

/(x)  =  EEc'gK  •*-*£)  •  (19) 

a  =  l  fi  =  l 

We  now  define  the  following  one-dimensional  function: 
hfi(y)  =  X  coG(y  -  ta) 

a  =  1 

and  rewrite  the  approximation  scheme  of  eq:  (19)  as 
d' 

/(x)  =  XMW^  'x)  •  (20) 

i-1 


Notice  the  similarity  between  eq.  (20)  and  the  Projec¬ 
tion  Pursuit  Regression  technique:  in  both  schemes  the 
unknown  function  is  approximated  by  a  linear  superposi¬ 
tion  of  one-dimensional  variables,  which  are  projections 
of  the  original  variables  on  certain  vectors  that  have  been 
estimated.  In  Projection  Pursuit  Regression  the  choice 
of  the  functions  hk{y)  is  left  to  the  user.  In  our  case  the 
hk  are  one-dimensional  Radial  Basis  Functions,  for  ex¬ 
ample  cubic  splines,  or  Gaussians.  The  choice  depends, 
strictly  speaking,  on  the  specific  prior,  that  is,  on  the 
specific  smoothness  assumptions  made  by  the  user.  In¬ 
terestingly,  in  many  applications  of  Projection  Pursuit 
Regression  the  functions  hj.  have  been  indeed  chosen  to 
be  cubic  splines. 

Let  us  briefly  review  the  steps  that  bring  us  from  the 
classical  additive  approximation  scheme  of  eq.  (9)  to 
a  Projection  Pursuit  Regression-like  type  of  approxima¬ 
tion: 

1.  the  regularization  parameters  8 M  of  the  classical  ap¬ 
proximation  scheme  (9)  are  considered  as  free  pa¬ 
rameters; 

2.  the  number  of  centers  is  chosen  to  be  smaller  than 
the  number  of  data  points; 

3.  it  is  assumed  that  the  true  relevant  variables  are 
some  unknown  linear  combination  of  the  original 
variables; 

We  notice  that  in  the  special  case  in  which  each  addi¬ 
tive  component  has  just  one  center  (n  =  1),  the  approx¬ 
imation  scheme  of  eq.  (19)  becomes: 

d' 

/(x)=X^(  wM.x  — <").  (21) 

n= i 

If  the  basis  function  G  were  a  sigmoidal  function  this 
would  be  clearly  a  standard  Multilayer  Perceptron  with 
one  layer  of  hidden  units.  Sigmoidal  functions  cannot 
be  derived  from  regularization  theory,  but  we  will  see  in 
section  (6)  the  relationship  between  a  sigmoidal  function 
and  a  basis  function  that  can  be  derived  from  regular¬ 
ization,  like  the  absolute  value. 

There  are  clearly  a  number  of  computational  issues  re¬ 
lated  to  how  to  find  the  parameters  of  an  approximation 
scheme  like  the  one  of  eq.  (19),  but  we  do  not  discuss 
them  here.  We  present  instead,  in  section  (7),  some  ex¬ 
perimental  results,  and  will  describe  the  algorithm  used 
to  obtain  them. 

5  Priors,  stabilizers  and  basis  functions 

It  is  well  known  that  a  variational  principle  such  as  equa¬ 
tion  (1)  can  be  derived  not  only  in  the  context  of  func¬ 
tional  analysis  (Tikhonov  and  Arsenin,  1977),  but  also  in 
a  probabilistic  framework  (Marroquin  et  al.,  1987;  Bert- 
ero  et  al.,  1988,  Wahba,  1990).  In  this  section  we  illus¬ 
trate  this  connection  informally,  without  addressing  the 
several  deep  mathematical  issues  of  the  problem. 

Suppose  that  the  set  g  =  {(x,-,i/j)  G  iT*  x  of 

data  has  been  obtained  by  random  sampling  a  function 
/,  defined  on  Rn ,  in  the  presence  of  noise,  that  is 
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/(xi)  =  yi+ei,  i  =  l,...,N  (22) 

where  et-  are  random  independent  variables  with  a  given 
distribution.  We  are  interested  in  recovering  the  func¬ 
tion  /,  or  an  estimate  of  it,  from  the  set  of  data  g.  We 
take  a  probabilistic  approach,  and  regard  the  function  f 
as  the  realization  of  a  random  field  with  a  known  prior 
probability  distribution.  Let  us  define: 


-  V[f\g]  as  the  conditional  probability  of  the  function 
f  given  the  examples  g. 


-  V[g\f]  as  the  conditional  probability  of  g  given  /.  If 
the  function  underlying  the  data  is  /,  this  is  the  prob¬ 
ability  that  by  random  sampling  the  function  /  at  the 
sites  {xj}^  the  set  of  measurement  {yi}£Li  *s  obtained, 
being  therefore  a  model  of  the  noise. 

-  V[f\.  is  the  a  priori  probability  of  the  random  field  /. 
This  embodies  our  a  priori  knowledge  of  the  function, 
and  can  be  used  to  impose  constraints  on  the  model, 
assigning  significant  probability  only  to  those  functions 
that  satisfy  those  constraints. 


Assuming  that  the  probability  distributions  V[g\f] 
and  V[f]  are  know,  the  posterior  distribution  V[f\g]  can 
now  be  computed  by  applying  the  Bayes  rule: 


V{f\g)«V[g\f]V[f}.  (23) 

We  now  make  the  assumption  that  the  noise  variables 
in  eq.  (22)  are  normally  distributed,  with  variance  cr. 
Therefore  the  probability  V[g\f]  can  be  written  as: 


V[g\f]  oc 


where  cr  is  the  variance  of  the  noise. 

The  model  for  the  prior  probability  distribution  V[f] 
is  chosen  in  analogy  with  the  discrete  case  (when  the 
function  /  is  defined  on  a  finite  subset  of  a  n-dimensional 
lattice)  for  which  the  problem  can  be  rigorously  formal¬ 
ized  (Marroquin  et  al.,  1987).  The  prior  probability  V[f] 
is  written  as 


V[f]  oc 

where  <f>[f]  is  a  smoothness  functional  of  the  type  de¬ 
scribed  in  section  (3)  and  a  a  positive  real  number.  This 
form  of  probability  distribution  gives  high  probability 
only  to  those  functions  for  which  the  term  <j>[f]  is  small, 
and  embodies  the  a  priori  knowledge  that  one  has  about 
the  system. 

Following  the  Bayes  rule  (23)  the  a  posteriori  proba¬ 
bility  of  /  is  written  as 

V[f\g]  oc  (24) 

One  simple  estimate  of  the  function  /  from  the  prob¬ 
ability  distribution  (24)  is  the  so  called  MAP  ( Maximum 
A  Posteriori )  estimate,  that  considers  the  function  that 
maximizes  the  a  posteriori  probability  V[f\g],  or  mini¬ 
mizes  the  exponent  in  equation  (24).  The  MAP  estimate 
of  /  is  therefore  the  minimizer  of  the  following  functional: 


H[f]  =  ^(j k  -  /(xi))2  +  A 4>[f)  . 

i=  1 

where  A  =  2<r2o:.  This  functional  is  the  same  as  that 
of  eq.  (1),  and  from  here  it  is  clear  that  the  parameter 
A,  that  is  usually  called  the  “regularization  parameter” 
determines  the  trade-off  between  the  level  of  the  noise 
and  the  strength  of  the  a  priori  assumptions  about  the 
solution,  therefore  controlling  the  compromise  between 
the  degree  of  smoothness  of  the  solution  and  its  closeness 
to  the  data. 

As  we  have  pointed  out  (Poggio  and  Girosi,  1989), 
prior  probabilities  can  also  be  seen  as  a  measure  of  com¬ 
plexity,  assigning  high  complexity  to  the  functions  with 
small  probability.  It  has  been  proposed  by  Rissanen 
(1978)  to  measure  the  complexity  of  a  hypothesis  in 
terms  of  the  bit  length  needed  to  encode  it.  It  turns 
out  that  the  MAP  estimate  mentioned  above  is  closely 
related  to  the  Minimum  Description  Length  Principle: 
the  hypothesis  /  which  for  given  g  can  be  described  in 
the  most  compact  way  is  chosen  as  the  “best”  hypothe¬ 
sis.  Similar  ideas  have  been  explored  by  others  (for  in¬ 
stance  Solomonoff  in  1978).  They  connect  data  compres¬ 
sion  and  coding  with  Bayesian  inference,  regularization, 
function  approximation  and  learning. 


5.1  The  Bayesian  interpretation  of  Generalized 
Regularization  Networks 

In  the  probabilistic  interpretation  of  standard  regulariza¬ 
tion  the  term  A cj>[f]  in  the  regularization  functional  cor¬ 
responds  to  the  following  prior  probability  in  a  Bayesian 
formulation  in  which  the  MAP  estimate  is  sought: 

V[f]  oc  e"AW]. 

From  this  point  of  view,  the  extension  of  section  (4)  cor¬ 
responds  (again  informally)  to  choose  an  a  priori  prob¬ 
ability  of  the  form 

V[f\ oc  J  6ge-^6(f(X)  -  g( Wx)) 

where  6g  means  that  a  functional  integration  is  being 
performed.  This  restricts  the  space  of  functions  on  which 
the  probability  distribution  is  defined  to  the  class  of  func¬ 
tions  that  can  be  written  as  /(x)  =  g( Wx),  and  assume 
a  prior  probability  distribution  e_;'^9,((x))]  for  the  func¬ 
tions  g,  where  <f>  is  now  a  radially  symmetric  stabilizer. 

In  a  similar  manner,  in  the  case  of  additive  approxi¬ 
mation  the  prior  probability  of  /  is  concentrated  on  those 
functions  /  that  can  be  written  as  sums  of  additive  com¬ 
ponents,  and  corresponding  priors  are  of  the  form: 

7>[/]oc  J  ^/(x)  -  E /**(*'*) 

This  is  equivalent  to  saying  that  we  know  a  priori  that 
the  underlying  function  is  additive. 


6  Additive  splines,  hinge  functions, 
sigmoidal  neural  nets 

In  the  previous  sections  we  have  shown  how  to  extend 
RN  to  schemes  that  we  have  called  GRN,  which  include 
ridge  approximation  schemes  of  the  PPR  type,  that  is 

d' 

/(x)  =  ’x)  - 
i= 1 

where 

n 

h»(y)  =  Y,c°G(y-t«)- 

a  =  1 

The  form  of  the  basis  function  G  depends  on  the  sta¬ 
bilizer,  and  a  list  of  “admissible”  G  has  been  given  in 
section  (3).  These  include  the  absolute  value  G(x)  =  |z| 
-  corresponding  to  piecewise  linear  splines,  and  the  func¬ 
tion  G(«)  =  |a;|3  -  corresponding  to  cubic  splines  (used 
in  typical  implementations  of  PPR),  as  well  as  Gaussian 
functions.  Though  it  may  seem  natural  to  think  that  sig¬ 
moidal  multilayer  perceptrons  may  be  included  in  this 
framework,  it  is  actually  impossible  to  derive  directly 
from  regularization  principles  the  sigmoidal  activation 
functions  typically  used  in  Multilayer  Perceptrons.  In 
the  following  section  we  show,  however,  that  there  is  a 
close  relationship  between  basis  functions  of  the  hinge, 
the  sigmoid  and  the  Gaussian  type. 

6.1  From  additive  splines  to  ramp  and  hinge 
functions 

We  will  consider  here  the  one-dimensional  case.  Mul¬ 
tidimensional  additive  approximations  consist  of  one¬ 
dimensional  terms  (once  the  W  has  been  fixed).  We 
consider  the  approximation  with  the  lowest  possible  de¬ 
gree  of  smoothness:  piecewise  linear.  The  associated 
basis  function  (?(*)  =  |as|  is  shown  in  figure  5  top  left, 
and  the  associated  stabilizer  is  given  by 


Its  use  in  approximating  a  one-dimensional  function  con¬ 
sists  of  the  linear  combination  with  appropriate  coeffi¬ 
cients  of  translates  of  |aj|.  It  is  easy  to  see  that  a  lin¬ 
ear  combination  of  two  translates  of  |z|  with  appropri¬ 
ate  coefficients  (positive  and  negative  and  equal  in  ab¬ 
solute  value)  yields  the  piecewise  linear  threshold  func¬ 
tion  shown  in  figure  5.  Linear  combinations  of 

translates  of  such  functions  can  be  used  to  approximate 
one-dimensional  functions.  A  similar  derivative-like,  lin¬ 
ear  combination  of  two  translates  of  <tl(x)  functions  with 
appropriate  coefficients  yields  the  Gaussian-like  function 
9l{%)  also  shown  in  figure  5.  Linear  combinations  of 
translates  of  this  function  can  also  be  used  for  approxi¬ 
mation  of  a  function.  Thus  any  given  approximation  in 
terms  of  3i(*)  can  be  rewritten  in  terms  of  and 

the  latter  can  be  in  turn  expressed  in  terms  of  the  basis 
function  |*|. 

Notice  that  the  basis  functions  |z|  underlie  the 
“hinge”  technique  proposed  by  Breiman  (1992),  whereas 


the  basis  functions  <yL( x)  are  sigmoidal-like  and  the 
9l{^)  are  Gaussian-like.  The  arguments  above  show  the 
close  relations  between  all  of  them,  despite  the  fact  that 
only  |  a  |  is  strictly  a  “legal”  basis  function  from  the  point 
of  view  of  regularization  (ffi(a)  is  not,  though  the  very 
similar  but  smoother  Gaussian  is).  Notice  also  that  |*| 
can  be  expressed  in  terms  of  “ramp”  functions,  that  is 
|*|  =  x+  + 

These  relationships  imply  that  it  may  be  interesting 
to  compare  how  well  each  of  these  basis  functions  is  able 
to  approximate  some  simple  function.  To  do  this  we  used 
the  model  /(*)  =  ^”caG( x  —  ta)  to  approximate  the 
function  h(x)  =  sin(27ra)  on  [0,1],  where  G(a)  is  one  of 
the  basis  functions  of  figure  5.  The  function  sin(27ra:) 
is  plotted  in  figure  6.  Fifty  training  points  and  10,000 
test  points  were  chosen  uniformly  on  [0, 1],  The  param¬ 
eters  were  learned  using  the  iterative  backfitting  algo¬ 
rithm  that  will  be  described  in  section  7.  We  looked  at 
the  function  learned  after  fitting  1,  2,  4,  8  and  16  basis 
functions.  The  resulting  approximations  are  plotted  in 
the  following  figures  and  the  errors  are  summarized  in 
table  1. 

The  results  show  that  the  performance  of  all  three 
basis  functions  is  fairly  close  as  the  number  of  basis 
functions  increases.  All  models  did  a  good  job  of  ap¬ 
proximating  sin(27ra:).  The  absolute  value  function  did 
slightly  better  and  the  “Gaussian”  function  did  slightly 
worse.  It  is  interesting  that  the  approximation  using  two 
absolute  value  functions  is  almost  identical  to  the  ap¬ 
proximation  using  one  “sigmoidal”  function  which  again 
shows  that  two  absolute  value  basis  functions  can  sum 
to  equal  one  “sigmoidal”  piecewise  linear  function. 

7  Numerical  illustrations 

7.1  Comparing  additive  and  non-additive 
models 

In  order  to  illustrate  some  of  the  ideas  presented  in  this 
paper  and  to  provide  some  practical  intuition  about  the 
various  models,  we  present  numerical  experiments  com¬ 
paring  the  performance  of  additive  and  non-additive  net¬ 
works  on  two-dimensional  problems.  In  a  model  consist¬ 
ing  of  a  sum  of  two-dimensional  Gaussians,  the  model 
can  be  changed  from  a  non-additive  Radial  Basis  Func¬ 
tion  network  to  an  additive  network  by  “elongating”  the 
Gaussians  along  the  two  coordinate  axes.  This  allows  us 
to  measure  the  performance  of  a  network  as  it  changes 
from  a  non-additive  scheme  to  an  additive  one. 

Five  different  models  were  tested.  The  first  three  dif¬ 
fer  only  in  the  variances  of  the  Gaussian  along  the  two 
coordinate  axes.  The  ratio  of  the  x  variance  to  the  y  vari¬ 
ance  determines  the  elongation  of  the  Gaussian.  These 
models  all  have  the  same  form  and  can  be  written  as: 

N 

/(x)  =  X!ci[Gi(x~xi)  +  G2(x~x<)] 

1  =  1 

where 


and 


The  models  differ  only  in  the  values  of  cr\  and  <72-  For 
the  first  model,  <7X  =  .5  and  <72  =  .5  (RBF),  for  the 
second  model  ax  =  10  and  <r2  =  .5  (elliptical  Gaussian), 
and  for  the  third  model,  oq  =  00  and  <r2  =  .5  (additive). 
These  models  correspond  to  placing  two  Gaussians  at 
each  data  point  x*,  with  one  Gaussian  elongated  in  the 
x  direction  and  one  elongated  in  the  y  direction.  In  the 
first  case  (RBF)  there  is  no  elongation,  in  the  second  case 
(elliptical  Gaussian)  there  is  moderate  elongation,  and 
in  the  last  case  (additive)  there  is  infinite  elongation.  In 
these  three  models,  the  centers  were  fixed  in  the  learning 
algorithm  and  equal  to  the  training  examples.  The  only 
parameters  that  were  learned  were  the  coefficients 

The  fourth  model  is  an  additive  model  of  the  form 
(18),  in  which  the  number  of  centers  is  smaller  than  the 
number  of  data  points,  but  the  additive  components  are 
independent,  and  can  be  written  as: 

n  n 

f(x,  y)  =  '£  baG(x  -  £)  +  £  c,G(y  -  t%) 

a=l  /3  =  1 

where  the  basis  function  is  the  Gaussian: 

G(x)=e~2x\ 

In  this  model,  the  centers  were  also  fixed  in  the  learn¬ 
ing  algorithm,  and  were  a  proper  subset  of  the  training 
examples,  so  that  there  were  fewer  centers  than  exam¬ 
ples.  In  the  experiments  that  follow,  7  centers  were  used 
with  this  model,  and  the  coefficients  ba  and  ca  were  de¬ 
termined  by  least  squares. 

The  fifth  model  is  a  Generalized  Regularization  Net¬ 
work  model,  of  the  form  (21),  that  uses  a  Gaussian  basis 
function: 

/(x)  =  ^cae-(w»x-f«)3. 

a  =  l 

In  this  model  the  weight  vectors,  centers,  and  coeffi¬ 
cients  are  all  learned. 

The  coefficients  of  the  first  four  models  were  set  by 
solving  the  linear  system  of  equations  by  using  the 
pseudo-inverse,  which  finds  the  best  mean  squared  fit 
of  the  linear  model  to  the  data. 

The  fifth  model  was  trained  by  fitting  one  basis  func¬ 
tion  at  a  time  according  to  the  following  algorithm: 

•  Add  a  new  basis  function; 

•  Optimize  the  parameters  wa,  ta  and  c„  using  the 
random  step  algorithm  (described  below); 

•  Backfitting:  for  each  basis  function  a  added  so  far: 

—  hold  the  parameters  of  all  other  functions 
fixed; 

—  reoptimize  the  parameters  of  function  a; 

•  Repeat  the  backfitting  stage  until  there  is  no  sig¬ 
nificant  decrease  in  T2  error. 


The  random  step  algorithm  (Caprile  and  Girosi,  1990) 
for  optimizing  a  set  of  parameters  works  as  follows.  Pick 
random  changes  to  each  parameter  such  that  each  ran¬ 
dom  change  lies  within  some  interval  [a,  6].  Add  the 
random  changes  to  each  parameter  and  then  calculate 
the  new  error  between  the  output  of  the  network  and 
the  target  values.  If  the  error  decreases,  then  keep  the 
changes  and  double  the  length  of  the  interval  for  pick¬ 
ing  random  changes.  If  the  error  increases,  then  throw 
out  the  changes  and  halve  the  size  of  the  interval.  If  the 
length  of  the  interval  becomes  less  than  some  threshold, 
then  reset  the  length  of  the  interval  to  some  larger  value. 

The  five  models  were  each  tested  on  two  different  func¬ 
tions:  a  two-dimensional  additive  function: 

h(x,  y )  =  sin(2-7ra)  +  4 (y  —  0.5)2 
and  the  two-dimensional  Gabor  function: 

g(x,  y)  =  e~^xN  cos(.75x(a;  +  y)). 

The  graphs  of  these  functions  are  shown  in  figure 
10.  The  training  data  for  the  additive  function  con¬ 
sisted  of  20  points  picked  from  a  uniform  distribution 
on  [0,1]  x  [0,1].  Another  10,000  points  were  randomly 
chosen  to  serve  as  test  data.  The  training  data  for  the 
Gabor  function  consisted  of  20  points  picked  from  a  uni¬ 
form  distribution  on  [—1,1]  x  [—1,1]  with  an  additional 
10,000  points  used  as  test  data. 

In  order  to  see  how  sensitive  were  the  performances 
to  the  choice  of  basis  function,  we  also  repeated  the  ex¬ 
periments  for  the  models  3,  4  and  5  with  a  sigmoid  (that 
is  not  a  basis  function  that  can  be  derived  from  regular¬ 
ization  theory)  replacing  the  Gaussian  basis  function.  In 
our  experiments  we  used  the  standard  sigmoid  function: 


These  models  (6,  7  and  8)  are  shown  in  table  2  together 
with  models  1  to  5.  Notice  that  only  model  8  is  a  Multi¬ 
layer  Perceptron  in  the  standard  sense.  The  results  are 
summarized  in  table  3. 

Plots  of  some  of  the  approximations  are  shown  in  fig¬ 
ures  11,  12,  13  and  14.  As  expected,  the  results  show 
that  the  additive  model  was  able  to  approximate  the  ad¬ 
ditive  function,  h(x,y)  better  than  both  the  RBF  model 
and  the  elliptical  Gaussian  model.  Also,  there  seems  to 
be  a  smooth  degradation  of  performance  as  the  model 
changes  from  the  additive  to  the  Radial  Basis  Function 
(figure  11).  Just  the  opposite  results  are  seen  in  approx¬ 
imating  the  non-additive  Gabor  function,  g(x,y).  The 
RBF  model  did  very  well,  while  the  additive  model  did 
a  very  poor  job  in  approximating  the  Gabor  function 
(figures  12  and  13a).  However,  we  see  that  the  GRN 
scheme  (model  5),  gives  a  fairly  good  approximation  (fig¬ 
ure  13b).  This  is  due  to  the  fact  that  the  learning  al¬ 
gorithm  was  able  to  find  better  directions  to  project  the 
data  than  the  x  and  y  axes  as  in  the  pure  additive  model. 
We  can  also  see  from  table  3  that  the  additive  model 
with  fewer  centers  than  examples  (model  4)  has  a  larger 
training  error  than  the  purely  additive  model  3,  but  a 
much  smaller  test  error.  The  results  for  the  sigmoidal 
additive  model  learning  the  additive  function  h  (figure 


14)  show  that  it  is  comparable  to  the  Gaussian  additive 
model.  The  first  three  models  we  considered  had  a  num¬ 
ber  of  parameters  equal  to  the  number  of  data  points, 
and  were  supposed  to  exactly  interpolate  the  data,  so 
that  one  may  wonder  why  the  training  errors  are  not 
exactly  zero.  This  is  due  to  the  ill-conditioning  of  the 
associated  linear  system,  which  is  a  common  problem  in 
Radial  Basis  Functions  (Dyn,  Levin  and  Rippa,  1986). 

8  Summary  and  remarks 

A  large  number  of  approximation  techniques  can  be  writ¬ 
ten  as  multilayer  networks  with  one  hidden  layer,  as 
shown  in  figure  (16).  In  past  papers  (Poggio  and  Girosi, 
1989;  Poggio  and  Girosi,  1990,  1990b;  Maruyama,  Girosi 
and  Poggio,  1992)  we  showed  how  to  derive  RBF,  HBF 
and  several  types  of  multidimensional  splines  from  reg¬ 
ularization  principles  of  the  form  used  to  deal  with  the 
ill-posed  problem  of  function  approximation.  We  had 
not  used  regularization  to  yield  approximation  schemes 
of  the  additive  type  (Wahba,  1990;  Hastie  and  Tibshi- 
rani,  1990),  such  as  additive  splines,  ridge  approxima¬ 
tion  of  the  PPR  type  and  hinge  functions.  In  this  paper, 
we  show  that  appropriate  stabilizers  can  be  defined  to 
justify  such  additive  schemes,  and  that  the  same  exten¬ 
sions  that  leads  from  RBF  to  HBF  leads  from  additive 
splines  to  ridge  function  approximation  schemes  of  the 
Projection  Pursuit  Regression  type.  Our  Generalized 
Regularization  Networks  include,  depending  on  the  sta¬ 
bilizer  (that  is  on  the  prior  knowledge  on  the  functions 
we  want  to  approximate),  HBF  networks,  ridge  approxi¬ 
mation  and  tensor  products  splines.  Figure  (15)  shows  a 
diagram  of  the  relationships.  Notice  that  HBF  networks 
and  Ridge  Regression  networks  are  directly  related  in 
the  special  case  of  normalized  inputs  (Maruyama,  Girosi 
and  Poggio,  1992).  Also  note  that  Gaussian  HBF  net¬ 
works,  as  described  by  Poggio  and  Girosi  (1990)  contain 
in  the  limit  the  additive  models  we  describe  here. 

We  feel  that  there  is  now  a  theoretical  framework  that 
justifies  a  large  spectrum  of  approximation  schemes  in 
terms  of  different  smoothness  constraints  imposed  within 
the  same  regularization  functional  to  solve  the  ill-posed 
problem  of  function  approximation  from  sparse  data. 
The  claim  is  that  all  the  different  networks  and  cor¬ 
responding  approximation  schemes  can  be  justified  in 
terms  of  the  variational  principle 

N 

m  =  £(/(*«•)  -  yi)2  +  a  m .  (25) 

i- 1 

They  differ  because  of  different  choices  of  stabilizers 
<t>,  which  correspond  to  different  assumptions  of  smooth¬ 
ness.  In  this  context,  we  believe  that  the  Bayesian  inter¬ 
pretation  is  one  of  the  main  advantages  of  regularization: 
it  makes  clear  that  different  network  architectures  cor¬ 
respond  to  different  prior  assumptions  of  smoothness  of 
the  functions  to  be  approximated. 

The  common  framework  we  have  derived  suggests 
that  differences  between  the  various  network  architec¬ 
tures  are  relatively  minor,  corresponding  to  different 
smoothness  assumptions.  One  would  expect  that  each 


architecture  will  work  best  for  the  class  of  function  de¬ 
fined  by  the  associated  prior  (that  is  stabilizer),  an  ex¬ 
pectation  which  is  consistent  with  numerical  results  (see 
our  numerical  experiments  in  this  paper,  and  Maruyama 
et  al.  1992;  see  also  Donohue  and  Johnstone,  1989). 

Of  the  several  points  suggested  by  our  results  we  will 
discuss  one  here:  it  regards  the  surprising  relative  suc¬ 
cess  of  additive  schemes  of  the  ridge  approximation  type 
in  real  world  applications. 

As  we  have  seen,  ridge  approximation  schemes  depend 
on  priors  that  combine  additivity  of  one-dimensional 
functions  with  the  usual  assumption  of  smoothness.  Do 
such  priors  capture  some  fundamental  property  of  the 
physical  world?  Consider  for  example  the  problem  of 
object  recognition,  or  the  problem  of  motor  control.  We 
can  recognize  almost  any  object  from  any  of  many  small 
subsets  of  its  features,  visual  and  non-visual.  We  can 
perform  many  motor  actions  in  several  different  ways.  In 
most  situations,  our  sensory  and  motor  worlds  are  redun¬ 
dant.  In  terms  of  GRN  this  means  that  instead  of  high¬ 
dimensional  centers,  any  of  several  lower-dimensional 
centers  are  often  sufficient  to  perform  a  given  task.  This 
means  that  the  “and”  of  a  high-dimensional  conjunction 
can  be  replaced  by  the  “or”  of  its  components  -  a  face 
may  be  recognized  by  its  eyebrows  alone,  or  a  mug  by 
its  color.  To  recognize  an  object,  we  may  use  not  only 
templates  comprising  all  its  features,  but  also  subtem¬ 
plates,  comprising  subsets  of  features.  Additive,  small 
centers  -  in  the  limit  with  dimensionality  one  -  with  the 
appropriate  W  are  of  course  associated  with  stabilizers 
of  the  additive  type. 

Splitting  the  recognizable  world  into  its  additive  parts 
may  well  be  preferable  to  reconstructing  it  in  its  full  mul¬ 
tidimensionality,  because  a  system  composed  of  several 
independently  accessible  parts  is  inherently  more  robust 
than  a  whole  simultaneously  dependent  on  each  of  its 
parts.  The  small  loss  in  uniqueness  of  recognition  is  eas¬ 
ily  offset  by  the  gain  against  noise  and  occlusion.  There 
is  also  a  possible  meta-argument  that  we  report  here  only 
for  the  sake  of  curiosity.  It  may  be  argued  that  humans 
possibly  would  not  be  able  to  understand  the  world  if 
it  were  not  additive  because  of  the  too-large  number  of 
necessary  examples  (because  of  high  dimensionality  of 
any  sensory  input  such  as  an  image).  Thus  one  may  be 
tempted  to  conjecture  that  our  sensory  world  is  biased 
towards  an  “additive  structure”. 

A  Derivation  of  the  general  form  of 
solution  of  the  regularization 
problem 

We  have  seen  in  section  (2)  that  the  regularized  solu¬ 
tion  of  the  approximation  problem  is  the  function  that 
minimizes  a  cost  functional  of  the  following  form: 

N 

Hif]  =  ~  /(x*))2  +  A^[/]  •  (26) 

i=i 

jq  where  the  smoothness  functional  <£[/]  is  given  by 


For  the  smoothness  functional  we  have: 


The  first  term  measures  the  distance  between  the  data 
and  the  desired  solution  /,  and  the  second  term  measures 
the  cost  associated  with  the  deviation  from  smoothness. 
For  a  wide  class  of  functionals  <j>  the  solutions  of  the 
minimization  problem  (26)  all  have  the  same  form.  A 
detailed  and  rigorous  derivation  of  the  solution  of  the 
variational  principle  associated  with  eq.  (26)  is  outside 
the  scope  of  this  paper.  We  present  here  a  simple  deriva¬ 
tion  and  refer  the  reader  to  the  current  literature  for  the 
mathematical  details  (Wahba,  1990;  Madych  and  Nel¬ 
son,  1990;  Dyn,  1987). 

We  first  notice  that,  depending  on  the  choice  of  G, 
the  functional  <j)[f]  can  have  a  non-empty  null  space, 
and  therefore  there  is  a  certain  class  of  functions  that 
are  “invisible”  to  it.  To  cope  with  this  problem  we  first 
define  an  equivalence  relation  among  all  the  functions 
that  differ  for  an  element  of  the  null  space  of  4>[f\.  Then 
we  express  the  first  term  of  H[f]  in  terms  of  the  Fourier 
transform  of  /: 

/(*)  =  C  f  ds  /(s)e*xs 

J  Rd 

obtaining  the  functional 


j-f  dS 

<5/(t)  Jr d  G(s) 


2  /  d,  /(-)</(*) 
Jr*  G( s)  6f{ t) 


i 


ds  6(s  —  t) 


2/M) 


iRd  G(  s)  '  7  G{  t) 

Using  these  results  we  can  now  write  eq.  (27)  as: 


i>  -  /(*»«*•* + = o . 

Changing  t  in  — t  and  multiplying  by  G(t)  on  both  sides 
of  this  equation  we  get: 


N 


/(t)  =  G(-t)£ 


i- 1 


im  -  /(*0)  jx,-t 

A 


We  now  define  the  coefficients 


«  =  i* -!<■«»  i=i . », 


assume  that  G  is  symmetric  (so  that  its  Fourier  trans¬ 
form  is  real),  and  take  the  Fourier  transform  of  the  last 
equation,  obtaining: 


l/»|2 


=  [  ds  /(s)e*Xi  S)2+A  f  ds  ^ 

Jr d  J  Rd  G(  s) 

Then  we  notice  that  since  /  is  real,  its  Fourier  transform 
satisfies  the  constraint: 


/*(■)  =  /(“•) 

so  that  the  functional  can  be  rewritten  as: 


N  N 

/(x)  =  -  X)  *  G(x)  =  53  CiG(X  “  Xi)  • 

i= 1  i  —  1 

We  now  remember  that  we  had  defined  as  equivalent  all 
the  functions  differing  by  a  term  that  lies  in  the  null 
space  of  <j>[f],  and  therefore  the  most  general  solution  of 
the  minimization  problem  is 


H[f]  =  53(tt-C  /  ds  /(s)eiXi's)2+A  f 

J  Rd  JR‘ 


ds 


/(-s)/(s) 


/Rd  G(  s) 

In  order  to  find  the  minimum  of  this  functional  we  take 
its  functional  derivatives  with  respect  to  /: 


8SQ  =  0  Vt  €  Rd  ■  (27) 

6f(t) 

We  now  proceed  to  compute  the  functional  derivatives 
of  the  first  and  second  term  of  H[f].  For  the  first  term 
we  have: 


6f{ *)  S 


E(w  -  C  [  ds  /(s)eiXi  S)2 

JRd 


*/(  s) 


,iXj-S 


«  r 
=  2  V(W  “/(*<))  /  ds 
tC  ”JRd  6f(  t) 

■£  /• 

2T(yi-f(xi))  ds  tf(s  -  t)eiXi  S 


/i?<i 

JV 


=  2  53(»-/(xi))e 


i=l 


N 

/(x)  =  53  CiG(X  -  *i)  +  p(x) 

1=1 

where  p(x)  is  a  term  that  lies  in  the  null  space  of  <j>[f]. 

B  Approximation  of  vector  fields 

through  multioutput  regularization 
networks 

Consider  the  problem  of  approximating  a  vector  field 
y(x)  from  a  set  of  sparse  data,  the  examples,  which  are 
pairs  ( yi ,  Xj)  for  i  =  1  •  •  •  N.  Choose  a  Generalized  Reg¬ 
ularization  Network  as  the  approximation  scheme,  that 
is,  a  network  with  one  “hidden”  layer  and  linear  output 
units.  Consider  the  case  of  N  examples,  n  <  N  centers, 
input  dimensionality  d  and  output  dimensionality  q  (see 
figure  17).  Then  the  approximation  is 

n 

y(x)  =  CiG(x  _  XJ) 

i= 1 

with  G  being  the  chosen  Green  function.  The  equation 
can  be  rewritten  in  matrix  notation  as 
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y(x)  =  Cg(x) 


where  g  is  the  vector  with  elements  gi  =  G(x  —  x*). 

Let  us  define  as  G  the  matrix  of  the  chosen  Green  func¬ 
tion  evaluated  at  the  examples,  that  is,  the  matrix  with 
elements  Gitj  =  G(xt-  —  Xj).  Then  the  “weights”  c  are 
“learned”  from  the  examples  by  solving 

Y  =  CG 

where  Y  is  defined  as  the  matrix  in  which  column  l  is 
the  example  y j.  C  is  defined  as  the  matrix  in  which  row 
m  is  the  vector  cm.  This  means  that  x  is  a  d  x  1  matrix, 
C  is  a  q  x  n  matrix,  Y  is  a  q  x  N  matrix  and  G  is  a 
n  x  N  matrix.  Then  the  set  of  weights  C  is  given  by 

C  =  YG+ 

It  also  follows  (though  it  is  not  so  well  known)  that 
the  vector  field  y  is  approximated  by  the  network  as  the 
linear  combination  of  the  example  fields  y ;  ,  that  is 

y(x)  =  YG+g(x) 

which  can  be  rewritten  as 

N 

y(x)  =  X!6'(x)yi 
!  =  1 

where  the  6;  depend  on  the  chosen  G,  according  to 

b(x)  =  G+g(x) 

Thus  for  any  choice  of  the  regularization  network  - 
even  HBF  -  and  any  choice  of  the  Green  function  -  in¬ 
cluding  Green  functions  corresponding  to  additive  splines 
and  tensor  product  splines  -  the  estimated  output  vec¬ 
tor  is  always  a  linear  combination  of  example  vectors 
with  coefficients  b  that  depend  (nonlinearly)  on  the  in¬ 
put  value.  The  result  is  valid  for  all  networks  with  one 
hidden  layer  and  linear  outputs,  provided  that  a  L 2  cri¬ 
terion  is  used  for  training.  Thus,  for  all  types  of  regu¬ 
larization  networks  and  all  Green  functions  the  output 
is  always  a  linear  combination  of  output  examples  (see 
Poggio  and  Girosi  1989). 
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1  basis 
function 

2  basis 
functions 

4  basis 
functions 

8  basis 
functions 

16  basis 
functions 

Absolute  value 

train:  0.798076 
test:  0.762225 

0.160382 

0.127020 

0.011687 

0.012427 

0.000555 

0.001179 

0.000056 

0.000144 

“Sigmoidal” 

train:  0.161108 
test:  0.128057 

0.131835 

0.106780 

0.001599 

0.001972 

0.000427 

0.000787 

0.000037 

0.000163 

“Gaussian” 

train:  0.497329 
test:  0.546142 

0.072549 

0.087254 

0.002880 

0.003820 

0.000524 

0.001211 

0.000024 

0.000306 

Table  1:  L 2  training  and  test  error  for  each  of  the  3  piece  wise  linear  models  using  different  numbers  of  basis  functions. 


Model  1 

(7i  =  cr2  =  0.5 

Model  2 

ft*-*;)*  ,  (V-Vi)2  \  /(*-*, )2  ,  (y-»,)2\ 

f(?,y)  =  Z2ii1ci[e  y  ^  )  +  e  1  '»  /] 

crj  =  10,  a2  =  0.5 

Model  3 

RSlBSlfflS  EFSS31;- •  /  v.* 

cr  -  0.5 

Model  4 

f(x,y)  =  Zl=ibQe-i  +ELi<*« - 

a  =  0.5 

Model  5 

/(*,»)  =  S:=.c.e-(w-—‘->‘ 

- 

Model  6 

f{x,y)  =  £;=i  Ci[<r(x  -  Xi)  +  <r(y  -  Vi)\ 

- 

Model  7 

f(x,y)  =  E«=1  -  t°)  +  ELic/3 <r(y  -  tv) 

- 

Model  8 

f{x,y)  =  Ea= lCaO-(wa  -  x-ta) 

- 

Table  2:  The  eight  models  we  tested  in  our  numerical  experiments. 


Model  1 

Model  2 

Model  3 

Model  4 

Model  5 

Model  6 

Model  7 

Model  8 

h(x, y) 

train:  0.000036 
test:  0.011717 

0.000067 

0.001598 

0.000001 

0.000007 

0.000001 

0.000009 

0.000170 

0.001422 

0.000001 

0.000015 

0.000003 

0.000020 

0.000743 

0.026699 

d(x,y) 

train:  0.000000 
test:  0.003818 

0.000000 

0.344881 

0.000000 

67.95237 

0.345423 

1.222111 

0.000001 

0.033964 

0.000000 

98.419816 

0.456822 

1.397397 

0.000044 

0.191055 

Table  3:  A  summary  of  the  results  of  our  numerical  experiments.  Each  table  entry  contains  the  L2  errors  for  both 
the  training  set  and  the  test  set. 
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G(r)  =  rA2  ln(r) 
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z  =  exp(-  xA2  -  yA2) 


Figure  2:  The  Gaussian  basis  function  G(r )  =  e  r  ,  where  r 
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Figure  3:  The  basis  function  G(x)  =  e 


z  =  exp(-xA2) 


z  =  exp(-yA2) 


a) 


b) 


z  =  exp(-xA2)  +  exp(-yA2) 


C) 

Figure  4:  In  (c)  it  is  shown  an  additive  basis  function,  in  the  case  in  which  the  additive  component  of  the  basis 
functions  (a  and  b)  are  gaussian. 
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-0.4-0. 2 


0.2  0.4 


-0.4-0. 2 


0.2  0.4 


Figure  5:  a)  Absolute  value  basis  function,  |z|,  b)  “Sigmoidal”  basis  function  ctl(x)  c)  Gaussian-like  basis  function 
9l(*) 
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Figure  11:  a)  RBF  Gaussian  model  approximation  of  h(x,y)  (model  1).  b)  Elliptical  Gaussian  model  approximation 
of  h(x,  y)  (model  2).  c)  Additive  Gaussian  model  approximation  of  h(x,  y)  (model  3). 


Figure  12:  a)  RBF  Gaussian  model  approximation  of  g(x,  y)  (model  1).  b)  Elliptical  Gaussian  model  approximation 
oig(x,y)  (model  2). 
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Networks  (GRN)  Regularization  networks  (RN) 


X  Y 


f 


Figure  16:  The  most  general  network  with  one  hidden  layer  and  scalar  output. 
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Figure  17:  The  most  general  network  with  one  hidden  layer  and  vector  output. 
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